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The resolution of dynamics in out of equilibrium quantum spin systems relies at the 
heart of fundamental questions among Quantum Information Processing, Statistical 
Mechanics and Nano- Technologies. Efficient computational simulations of interact- 
ing many-spin systems are extremely valuable tools for tackling such questions. Here, 
we use the Trotter-Suzuki (TS) algorithm, a well-known strategy that provides the 
evolution of quantum systems, to address the spin dynamics. We present a GPU 
implementation of a particular TS version, which has been previously implemented 
on single cores in CPUs. We develop a massive parallel version of this algorithm 
and compare the efficiency between CPU and GPU implementations. This boosted 
method reduces the execution time in several hundred times and is capable to deal 
with systems of up to 27 spins (only limited by the GPU memory). 



* dente@famaf.unc.edu. ar 



2 



I. INTRODUCTION 

The evolution of a generic quantum spin system is described by an appropriate Schrodinger 
equation, where the Hamiltonian operator H encloses all the information about its dynam- 
ics, including external fields and spin-spin interactions. The "academic" strategy involves 
the solution of an eigen-problem for H and the state vectors [1]. This can be a major 
computational task, since the dimensionality of the underlying Hilbert space scales expo- 
nentially with the number N of interacting spins in the system. In practice, such scenario 
constrains the problem to the treatment of few-spins systems, or dealing with special cases 
where symmetries provide further simplifications [2]. 

As alternatives to the full matrix diagonalization, we recall strategies such as the Lanc- 
zos algorithm [3-5] and the Trotter- Suzuki (TS) algorithm [6, 7]. The Lanczos algorithm 
is usually employed in more sophisticated and well-known methods like the Density Matrix 
Renormalization Group [8, 9], which is a truncation of the Hilbert space to the specific 
energy excitations. However, it may be inadequate for dealing with high-temperature quan- 
tum systems, which is the case of spin dynamics (SD) relevant for Magnetic Resonance 
(MR) [10]. Recently, an algorithm based on the state space restriction approximation has 
been proposed for such a scenario [11]. This method relies on beforehand assumptions about 
spin environments, that provide a physically grounded Hilbert space truncation. 

The TS algorithm divides the Hamiltonian operator into a sum of diagonal and non- 
diagonal terms. A full unitary evolution is built by successive (partial) short-time evolution 
operators, each arising from the Hamiltonian terms. Provided that the time steps are short 
enough, the exact evolution is well approximated by the successive partial evolutions. This 
method has been implemented on general purpose Graphics Processing Units (GPU) to 
simulate single particle dynamics when the Schrodinger equation is reduced into the tight- 
binding approximation [12, 13]. 

In this article we report the implementation of a 4th order TS decomposition for SD in 
GPU. We consider generic spin Hamiltonians without any Hilbert space truncation or any 
further assumption about symmetries or ad hoc dynamical conditions. Our implementation 
is based on the XYZ decomposition [14] , in which the Hamiltonian is partitioned in the X, 
Y and Z components of the interactions bilinear in spin. Since the Z component can be 
written in a simple diagonal form, it only modifies the phases of the states. Moreover, X 
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and Y components are intrinsically non-diagonal, and thus local rotations are performed in 
order to map them into Z-like terms. 

The Z-like terms and the local rotations are implemented by means of a massive par- 
allelization scheme. The speedup obtained goes beyond 700 times compared with a single 
CPU core implementation. Despite this decrease in the execution time, the system size is 
still bounded by the maximum amount of memory available, which is about 6GB for current- 
generation GPUs. Thus, our implementation is capable to reach a system of 27 spins, which 
is still a significant number when compared to other methods. 

The paper is organized as follows. In Sec. II we summarize the basis of the TS algorithm. 
In Sec. Ill we describe several implementations in GPUs, their performance and accuracy. 
We include also in A a brief discussion on two scenarios where this computational strategy 
can be employed: MR spin ensemble calculations, where the present work is up-to-date 
being exploited, and the potential use in the evolution of Matrix Product States. Sec. IV 
concludes the present work. 

II. THE TROTTER-SUZUKI ALGORITHM 

In this section we summarize the TS method for spin systems as presented in Ref. [7]. 
We start considering the following total spin Hamiltonian: 

N N 

ff = EE^l+E E J h s ? s t, (!) 

j=la=x,y,z j,k=la=x,y,z 

where is the spin operator at site j on axis a — x,y, z. The parameters define the 
local fields or chemical shifts, with the constrain (h?) = 0, while the J" k are the coupling 
between pair of spins. 

We assume that the A-spin system is initially described by a state vector \^o), which is 
expanded in the computational Ising basis: 

|*o) = i>|A). (2) 
i=i 

Here, q are complex coefficients and \/3i) are tensor products of eigenvectors of SJ op- 
erators, typically referred as the S z - decoupled or Ising basis. In A we discuss two relevant 
cases in which pure states, as in Eq. 2, are employed to evaluate the dynamics in complex 
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many-body systems. In fact, we stress here that the present computational strategy can be 
substantially exploited when combined with sophisticated physically-based representations. 

The state of the system at an arbitrary time t is formally given by |\P t ) = U(£) l^o) = 
e -itH/n |\j/ )_ p or simplicity we set from now on h — 1. 

The main idea of the TS method relies on finding an appropriate partition of H which 
may yield a set of simple partial evolutions that approximate the exact one U(t). If we 
consider Eq. 1 as a particular sum of terms H — Hi + ... + H K , then: 



K 

U(i) = e~ itH = e -W 1+ ...+H K ) — ij m J TT 
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The first order approximation to U(i) is 

U(f) - Ui(t) = e- itffl ...e- itifif , (3) 
while the second and fourth order approximations can be expressed in terms of the first: 

U 2 (t) = U t 1 (-t/2)U 1 (t/2), 

U 4 (t) = U 2 (at)U 2 (at)U 2 ((l - 4a)t)U 2 (at)U 2 (at), 

where a = 1/(4 — 4 1 / 3 ). These are indeed satisfactory approximations to U(t) provided that 
t must be small enough compared to the maximum local time-scale determined by H . Such 
role is played by the local second moment of the interactions. This means that the partial 
evolution time step t must satisfy: 
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la=x,y,z N - a=x,y,z 

Now we address specifically how to build an operator Ui(t) given by Eq. 3 from the total 
Hamiltonian of Eq. 1. The natural choice for the partition sum are the single-spin and the 
two-spin terms, which shall be mapped into a diagonal representation by suitable rotations. 
In particular, we consider the operators that rotate X and Y axis into the Z axis: 
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and satisfy: 
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{K,2)'s x Rl /2 = s*, 
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These rotation operators are introduced to simplify the numerical computation, since 
the whole calculations are performed in the computational Ising basis, as mentioned above. 
For instance, the partial evolution exp [— ith" Sj~\ yields a trivial phase, while other cases 
shall be properly rotated using R^/2j anc ^ R-w/2j- The j-index labels each spin, and the 
corresponding global rotations are defined by Y = (g^ R^/2j an d X = Rl n / 2 j- 

Let us first consider the single-spin operations, 



exp —it 
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(6) 



a=x,y,z 

Here, we stress that the approximate equality relies on the validity of Eq. 4. Non-trivial 
exponentials are rotated: 



exp 



exp 
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(7) 



Even though this kind of single-spin operations can in fact be computed exactly without 
rotations and without the approximation of Eq. 6, our purpose is to write all the partial 
evolution operators in an explicit diagonal form. This strategy will be specifically exploited 
in the computational implementation, as shown in Section III. 

In analogy with Eq. 6, the two-spin operators yield: 



exp 
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(8) 



Once again, the Z-terms yield diagonal operators: 
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The remaining two-spin terms shall be rotated by Y and X accordingly, 



exp —it 



exp —it 



TV 
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Y exp —it 



Xexp —it 




(10) 



Notice that, from Eqs. 7 and 10, the rotations for single and double spin terms can be 
performed simultaneously. 

III. IMPLEMENTATION AND RESULTS 

Any state written in the form of Eq. 2 can be evolved by the successive application of 
partial evolution operators, as those defined in Sec. II. At any time, the state vector is 
represented by a double precision complex array that occupies 2 N x 16 bytes of memory. 
Its kth element stores the probability amplitude Ck for the corresponding basis (Ising) state. 
These basis states are tensor products of up and down configurations for each spin, and 
can be written according to the iV-bit binary representation of k. Naturally, the size of the 
system is constrained by the maximum amount of memory available in the GPU (in our 
case, 6Gb for a Tesla C2070). 

The implementation of the full evolution is divided into two main modules: phase- 
corrections and axis-rotations. 

Forward Rotation — > Phases in Z — > Backward Rotation 
A. Phase correction 



The first module performs the phase corrections depending on the Z-projection (up or 
down) and position of each spin in each state of the Ising basis (which are 2 N states). In this 
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stage there are no cross terms between different base states (all operations are diagonal), 
and hence the parallelization is trivial. However, due to the two-spin terms, the time of 
evaluation of each phase correction increases quadratically with the number of spins. 

If the Hamiltonian is independent of time, then phase corrections can be computed pre- 
viously, i.e. before performing the sequence of partial evolutions. In fact, Eq. 9 ensures that 
the phases only depend on the time step and the Hamiltonian parameters. The disadvantage 
of such pre-computed phase corrections relies on the amount of memory required, which is 
equivalent to that of the state vector. Thus, it increases the memory usage by 150%, limiting 
simulations to 27 spins on 6GB GPUs instead of 28, but reduces phase correction time by 
97%. 

Additionally, since every phase correction is followed by a backward rotation that also 
operates on the whole state, the phase correction kernel is merged with the backward rota- 
tion kernel. This means that phase correction is performed when the rotation kernel reads 
elements from memory and just before applying the rotation. This saves two global memory 
operations on each element and a kernel call. 

Combined, these improvements reduce the overall simulation time by up to 52%. 

B. Rotation 

The rotation module acts over a pair of basis states by means of the operators defined in 
Eq. 5. If a particular basis state has the j-th spin down, then it is paired to the basis state 
that has the j-th spin up and the same configuration for the rest of spins: 

n j,i — ■ ■ ■ -Ij ■ ■ ■ 

%t = ...t, ... = n hi + V. 

(11) 

This pairing procedure must range the whole Hilbert space. Hence, for every value of j 
the pairing must be repeated over all the 2 N ~ l states that have the sought configuration in 
the j-th spin. 

In order to show how the parallelization can be performed at this stage, let us exemplify 
one case of the pairing procedure. If we address the rotation of the second spin of a four-spin 
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system, then pairs of states under consideration are the following: 



III! lilt Utl lltt till tilt tltl 



and 



tltt 
tttt 




It!! Itlt Itt! Ittt ttl! ttlt ttt! 



From this example it is straightforward clear that there are no repetitions in any of the 
basis states when evaluating one specific rotation (for a particular j). Thus, each rotation 
operation within the same j-spin results independent and can be parallelized. 



Since each rotation is independent (fixed j), we initially wrote a GPU kernel that rotates 
the j-th spin on a single pair and launched 2 Ar ~ 1 threads, one per pair. However, as each 
individual rotation is computationally cheap, the kernel is limited by memory bandwidth to 
feed each thread with its pair of states. To avoid this bottleneck, we aim to read each of 
these states from memory and perform as many rotations as is possible before writing them 
back to memory. 

If we take any subset of spins F G V({1..N}) and fix their positions to {fi\i G F}, then 
the set of all states that satisfy these restrictions {s = s si..s n \Vi G F : = fa} is closed 
under the state pairing function that flips any spin that is not in F. This allows us to 
launch 2' F I thread blocks, and set the spin positions fa using the binary encoding of each 
thread block's unique index. This leaves each thread block with its own set of 2 N ~\ F \ states 
to rotate and, by its construction, every state's peers for spins {1..N} — F are also within 
the set. Thus, each thread block copy its states to the fast shared memory available in each 
multiprocessor within a GPU, and perform the rotations for all the spins in {1--N} — F 
without performing any further global memory accesses. Once these rotations are done, the 
results are copied back to global memory and the kernel ends. 

While this technique improves rotation times, the approach has a problem: when rotating 
higher spins, the lower bits of the state indices handled by a thread block are given by the 
block index, which results in adjacent states being interleaved across different thread blocks. 
For example, when rotating spins 4 and 5 in a single kernel call on a 5-spin problem, the 



set of states of thread block would be {00000 fe = 0; 01000 fe = 8; 10000 6 = 16; 11000 6 = 24}, 
thread block 1 would be {00001 6 = 1; 01001 6 = 9; 10001 6 = 17; 11001 6 = 25}, and so on. As 
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FIG. 1. Rotation of spins 4 through 9 using both multiple rotation kernels. The clear bits are 
set by the thread block index, while the striped and checkered spins are set within each block. 
Checkered spins are rotated in shared memory. 



GPU memory controllers are built for fetching lines of contiguous data, currently 128 bytes 
long, this interleaving wastes memory bandwidth significantly. 

To solve this problem, when rotating over higher spins we shift thread block indices 
a number of bits s to the left, and each thread block must contain the states for every 
combination of these s lower bits. While this means that the kernel rotates s fewer spins per 
run, it also arranges states into groups of 2 s contiguous elements in memory. Continuing 
with the previous example, if we had a shift of one element then thread block 1 would 
handle states {00010 fe = 2;00011 fe = 3; 01010 fe = 10; 01011 6 = 11; 10010 6 = 18; 10011 6 = 
19; HOlOf, = 26; 11011 fe = 27}. The improved memory bandwidth efficiency results in a 
further improvement of about 300% over the previous kernel for s = 3, which matches the 
memory controller's 128 bytes line length. 

As we measured rotation times for different values of s and different thread block sizes, we 
noticed that there are not optimal values that work for every rotation and every system size. 
Thus, we implement a minimization algorithm which searches the optimal value for each 
system size and records such configuration. This recorded setup is used in every running of 
the TS algorithm which uses the same GPU card. 



D. Speedup 

As a result of this implementation we plot in Fig. 2 the execution times for the CPU 
and each of the three GPU versions. At this point it is necessary to stress that the CPU 
version is not so tuned as the GPU implementations. However, it is shown with the purpose 
to remark the overall improvement obtained with the use of the GPU. 
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FIG. 2. Execution time (in logarithmic scale) for CPU and three GPU implementations of the 
Trotter-Suzuki method as a function of the number of spins. Each point correspond to the evolution 
with 50 Trotter-Suzuki steps. For low number of spins (N <= 5) the CPU- version has better 
performance than the GPU, while for N > 5 the GPU versions dominates showing an increase of 
performance that exceeds more than one hundred times the CPU version 

From Fig. 2, for small systems (N <= 5), it is observed that the CPU is faster than 
any GPU implementation. This behavior is consistent with the fact that for these sizes 
the problem fits into the cache memory of the CPU processor. Thus, the high memory 
bandwidth makes the algorithm to run faster than any GPU version. Additionally, the 
smooth growth observed in Fig. 2 for iV < 15 in the GPU versions, indicates that the GPU 
does not fill its full capability. Hence, the maximum efficiency in the GPU is not reached 
and the difference between the CPU and GPU progressively grows. 

In the opposite case, when N is considerably large, the execution time of the GPU imple- 
mentation increases exponentially. The inflection point around N ~ 14 indicates that the 
GPU reaches its full capacity, and afterwards each GPU version scales as a N , with a factor 
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a higher than 2. In Fig. 3-a and b we plot the speedup of selected implementations. The 
a parameter can be obtained by fitting linearly each curve for N > 14. We note that a is 
higher than 2 and the exact value depends on the algorithm version. For the CPU case we 
obtain a = 2.20, meanwhile for the GPU-Basic case a = 2.11 and for the GPU-Optimized 
version a = 2.09. These small differences between the implementations are reflected in the 
behavior of the speedup. As it is shown in Fig. 3, the speedup grows with the system size, 
evidencing the differences of the a parameters. 

It is worthy to stress the performance obtained with the GPU implementations, one of 
them without optimization at all (GPU-Basic) and the other a fully optimized version. In 
the first case, we observe a speedup of 73 times as compared to the CPU version (we consider 
N = 14 for reference). In the second case, i.e. the optimized one, the speedup reaches almost 
290 times. In terms of spin systems, this allows the evaluation of systems which have 8 extra 
spins in the same computational time. 

E. Accuracy 

The last issue to be addressed concerns to the accuracy of our TS implementation. Several 
comparisons were performed between the present method and exact diagonalization schemes 
[15-17], i.e. the solution of the Schrodinger equation as an eigenproblem. The evolution 
of local and non-local physical magnitudes was compared, for one-dimensional spin systems 
described by Hamiltonians in the form of Eq. 1. For N = 16 [18], a sequence of 5 x 10 6 
TS steps yields a relative difference bounded by 10 -8 . A second accuracy test was provided 
by the spurious total magnetization observed in polarization-conserving Hamiltonians. In 
fact, this a strictly physical condition: if the initial state has zero total magnetization, then 
the evolved one should remain so. In general, we observe a linear increase of the total 
magnetization of the spin system, which is consistent with the general expectancies of a 
linear increase of the TS error as function of the number of steps. In particular, for N = 22 
and 5 x 10 6 TS steps, we observe a relative deviation less than 10~ 6 , a precision which is 
good enough for practical purposes. 

An alternative strategy for error quantification is provided by the Loschmidt echo (LE) 
[19]. It evaluates the reversibility of a system's dynamics in the presence of uncontrolled 
degrees of freedom [20]. For a particular initial state |\& ), the standard LE definition [21] 
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FIG. 3. Speedup between the implementations, a: CPU vs GPU-Optimized. For large spin 
systems the GPU is about 800 times faster than the CPU. b: Optimized GPU version versus the 
basic implementation in the GPU. It is observed an increase greater than 6 times for long spin 
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is: 

M LE (2t) = | (%\ exp [i(H + E)t] exp [-iH t] \%) | 2 , (12) 

where Hq is a reversed Hamiltonian, and X encloses uncontrolled non-reversed degrees of 
freedom. In the present case, we set £ = 0, which for an ideal perfectly accurate computation 
would yield M LE = 1, Vt. In principle, this statement is completely irrespective of the TS 
approximation and its intrinsic accuracy (i.e. the validity of Eq. 4). As a matter of fact, 
exactly the same sequences of TS partial evolutions are applied in the forward and backwards 
dynamics, except for the change in the sign of t. Therefore, the deviations of Mle away 
from the unity could be intrinsically originated in the execution of floating point operations 
by the specific Hardware. 

We consider initial states given by random superpositions of Ising states (e.g. Eq. A2), for 
a A = 16 spin set. Two cases of different dynamical complexity are evaluated. The first case 
is a ring configuration described by a nearest neighbors Heisenberg Hamiltonian. As above, 
this interaction conserves total magnetization, and then the dynamics does not explore the 
whole Hilbert space. For this case, 5 x 10 5 TS steps yield |1 - M LE \ = 5.206647 x 1(T 8 , 
while 5 x 10 6 TS steps yield |1 - M LE \ = 5.2066504 x 10~ 7 . Again, the error appears to 
increase linearly with the number of TS steps. However, we stress here that these deviations 
cannot be associated to the TS decomposition. The second case is built from the first, 
adding double quantum terms SjS% — S^S^ up to third next nearest neighbors. In this 
situation, total magnetization is not conserved, and therefore dynamics effectively mixes all 
spin projection subspaces exploring the whole Hilbert space. It turns out that 5 x 10 6 TS 
steps yield |1 — M LE \ = 5.2094531 x 10~ 7 , which is slightly larger than the first case. This 
may indicate that errors in computational operations can depend on the complexity of the 
many-body dynamics. 

Once again, these examples evidence a satisfactory accuracy of our TS-GPU implemen- 
tation. A systematic study of the LE as an error quantifier is well beyond the scope of this 
article. This would require ranging over N and the number of TS steps, different Hamilto- 
nian complexities and initial states, among many other factors. However, the LE turns to 
be a promising witness to address computational errors in GPU and CPU implementations. 
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IV. CONCLUSIONS 

We presented here a GPU implementation that boosted the Trotter-Suzuki method for 
quantum spin dynamics. We developed a parallelization scheme to exploit the massive paral- 
lel architecture of the GPU cards. The results showed a significant increase of performance 
(which can reach more than one hundred times) as compared with the single CPU core 
implementation. 

Additionally, we showed several upgrades in the GPU versions that yield an extra speedup 
of 6 times for large spin systems. The benefits enabled by this massive parallel hardware, 
boosted the capability of evaluating the dynamics of considerably large quantum spin sys- 
tems. In particular, we were able to evolve a maximum of 27 spins (limited only by the 
GPU memory) in reasonable execution times. 

The comparison between our Trotter-Suzuki implementation with exact numerical ap- 
proaches yielded estimated relative errors, which turned to be fairly acceptable within the 
standard expectancies of this kind of computational strategy. 

The implementation of this algorithm and the efficiency achieved, open promising oppor- 
tunities for studying fundamental questions within the field of out-of-equilibrium quantum 
many-body systems. 
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Appendix A: Spin systems described by single pure states 

We will briefly mention here two cases where an interacting many-spin system S can be 
described or approximately described by, a pure state: the random superposition (entangled) 
states for high temperature systems, and the so-called Matrix Product States. These are 
suited candidates for our GPU-boosted TS implementation. 
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In the first case, S is characterized by the infinite temperature limit, as it is often the 
situation in Magnetic Resonance spin dynamics. Strictly speaking, S cannot be described 
by a pure (single vector) state as in Eq. 2, but by a highly mixed state, typically denoted 
by a density matrix. This represents a whole probabilistic ensemble, and contains all the 
statistical information about S [22]. The manipulation of the density matrix may rapidly be 
cumbersome due to memory requirements, as soon as its dimension scales as 2 N x 2 N . But, 
provided the observables to be evaluated are local (they involve just a few spins within S), 
one can use just a few pure entangled states to compute ensemble-averaged quantities [23]. 
This procedure enables a physical parallelization which relies on the quantum superposition 
principle. A simple case may be an initial state given by a single spin up-polarized (localized 
excitation), and the rest of them in the high-temperature thermal equilibrium, represented 
by a mixture of all states with amplitudes satisfying the appropriate statistical weights and 
random phases: 



where, analogously to Eq. 2, \(3j) are the states of the computational Ising basis. This 
case has been employed to evaluate specific time-dependent correlation functions for spin 
systems [20, 24, 25], avoiding the storage, manipulation and diagonalization of overwhelm- 
ingly large matrices. Most importantly, it is nowadays employed to address the problem of 
thermalization in closed quantum systems, being assisted with our GPU implementation. 

The second case, refers to Matrix Product States [26, 27], which constitute a set of states 
that succesfully approximates the exact state of S in many physical situations. They are 
intimately related to renormalization group methods[9], and have proved very useful to deal 
with dynamical observables in one-dimensional quantum spin systems[28]. 

For a chain of N spins, a Matrix Product State is given by: 



where A\ is a D- dimensional matrix, and ijv) represents an Ising state. The time- 

evolution of this kind of states is performed by a suitable TS algorithm[28], and therefore 
they are promising candidates for TS-GPU implementations like the one we present in this 




ifj = random phase, 
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